Author: Ray Wu
First, we load the gapminder and tidyverse packages:
library(gapminder)
library(tidyverse)
## Note: the specification for S3 class "difftime" in package 'lubridate' seems equivalent to one from package 'hms': not turning on duplicate class definitions for this class.
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
First, let’s take a look at the dataset:
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
So now, we know that we are dropping the ‘Oceania’ level from the ‘continent’ factor
Let’s take a look what would happen when we drop Oceania:
gapminder %>%
filter(continent == 'Oceania')
## # A tibble: 24 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1952 69.1 8691212 10040.
## 2 Australia Oceania 1957 70.3 9712569 10950.
## 3 Australia Oceania 1962 70.9 10794968 12217.
## 4 Australia Oceania 1967 71.1 11872264 14526.
## 5 Australia Oceania 1972 71.9 13177000 16789.
## 6 Australia Oceania 1977 73.5 14074100 18334.
## 7 Australia Oceania 1982 74.7 15184200 19477.
## 8 Australia Oceania 1987 76.3 16257249 21889.
## 9 Australia Oceania 1992 77.6 17481977 23425.
## 10 Australia Oceania 1997 78.8 18565243 26998.
## # ... with 14 more rows
Since we have 24 rows and 12 years for each country, we should have 24 entries less or 2 countries less after the modification, whatever one would prefer.
We can also get more information about the dataset as follows:
gapminder %>%
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
(gapminder_no_oceania = gapminder %>%
filter(continent != 'Oceania'))
## # A tibble: 1,680 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,670 more rows
Let’s check the modified factor:
gapminder_no_oceania$continent %>%
levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
We still have Oceania! We need to call the droplevels() function to actually drop Oceania.
(gapminder_no_oceania = gapminder_no_oceania %>%
droplevels())
## # A tibble: 1,680 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,670 more rows
gapminder_no_oceania$continent %>%
levels()
## [1] "Africa" "Americas" "Asia" "Europe"
Now, we see that Oceania is actually gone for good.
Creating a smaller version of the dataset to read/write from the disk (dataset filtered down to data from 2002) and to reorder factors
gapminder_2002 = gapminder %>%
filter(year == 2002)
(gapminder_asia_2002 = gapminder_2002 %>%
filter(continent == 'Asia'))
## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Bahrain Asia 2002 74.8 656397 23404.
## 3 Bangladesh Asia 2002 62.0 135656790 1136.
## 4 Cambodia Asia 2002 56.8 12926707 896.
## 5 China Asia 2002 72.0 1280400000 3119.
## 6 Hong Kong, China Asia 2002 81.5 6762476 30209.
## 7 India Asia 2002 62.9 1034172547 1747.
## 8 Indonesia Asia 2002 68.6 211060000 2874.
## 9 Iran Asia 2002 69.5 66907826 9241.
## 10 Iraq Asia 2002 57.0 24001816 4391.
## # ... with 23 more rows
Now we will see what happens before we reorder the factors:
gapminder_asia_2002 %>%
ggplot(aes(pop, country)) +
geom_point() +
scale_x_log10() +
ggtitle('log(Population) of Asian Countries, 2002')
It’s pretty diffcult to get any sense of ordering on this graph.
Now we will reorder the levels and re-make this plot:
gapminder_asia_2002 %>%
mutate(country = fct_reorder(country, pop, .fun=median)) %>%
ggplot(aes(pop, country)) +
geom_point() +
scale_x_log10() +
ggtitle('log(Population) of Asian Countries, 2002')
This is clearly a much better graph as it also allows us to - view the extreme points much easily - view the distribution much easily
It seems that we should be able to do the same thing with arrange(). After all, we are only sorting the data before plotting.
gapminder_asia_2002 %>%
arrange(pop) %>%
ggplot(aes(pop, country)) +
geom_point() +
scale_x_log10() +
ggtitle('log(Population) of Asian Countries, 2002')
This does not work because we are not changing the factors, which the plot is based off of. We are changing the rows in the table, but the categories are still plotted alphabetically.
Using fct_reorder, on the other hand, actually relabels the categories according to the ranking of their population. Hence, the plot with fct_reorder is different because the first category corresponds to the country with highest population, instead of the first country that comes along alphabetically.
We will demonstrate file i/o with the gapminder_2002 data frame.
write_csv(gapminder_2002, 'gapminder_2002.csv')
confirm that the file exists:
list.files(pattern = "gapminder_2002.csv")
## [1] "gapminder_2002.csv"
We see that the file gapminder_2002.csv exists so we know that write_csv worked as intended
read_data = read_csv('gapminder_2002.csv')
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
read_data %>% str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 142 obs. of 6 variables:
## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ continent: chr "Asia" "Europe" "Africa" "Africa" ...
## $ year : int 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
## $ lifeExp : num 42.1 75.7 71 41 74.3 ...
## $ pop : int 25268405 3508512 31287142 10866106 38331121 19546792 8148312 656397 135656790 10311970 ...
## $ gdpPercap: num 727 4604 5288 2773 8798 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 6
## .. ..$ country : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ continent: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ lifeExp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ pop : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ gdpPercap: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
We don’t see a factor anywhere. This indicates that the factors are not preserved after writing to a CSV file. We will see a better method to do this in the next section.
gapminder_2002 %>% saveRDS('gapminder_2002.rds')
Check to make sure that the file exists:
list.files(pattern = "gapminder_2002.rds")
## [1] "gapminder_2002.rds"
As expected, the file exists.
Now, read in the file again:
rds_file = readRDS('gapminder_2002.rds')
No errors! That’s a good start, now let’s check the dataset description:
rds_file %>% str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 142 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ year : int 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
## $ lifeExp : num 42.1 75.7 71 41 74.3 ...
## $ pop : int 25268405 3508512 31287142 10866106 38331121 19546792 8148312 656397 135656790 10311970 ...
## $ gdpPercap: num 727 4604 5288 2773 8798 ...
As expected, we do not encounter any problems with reading in the .rds file. In particular, we note that country and continent are factors as expected.
(Note that I filtered from the original data frame, so we still have 142 countries and 5 continents)
I am going to re-make a plot I handed in for assignment 2:
original:
ggplot(gapminder, aes(continent)) +
geom_bar(fill = 'dark green')
Let’s see how we can improve this: - count (on the y-axis is unclear). It seems that we are recording the number of countries, but it is not apparent from the axis - we should give a title - entries from different years are all mixed together. It is hard to imagine this being useful. - could be more colourful although the current scheme is readable
Instead, I am going to do the following: - contrast the total population of the 5 continents - express these values as percentages in order to make it easy to see which continents have increased and decreased their proportion of world population - give the graph a meaningful title - use colour to contrast the change or some other meaningful way - separate the years
First, let’s calculate the sum of population for each continent/each year
plot_data = gapminder %>%
group_by(continent,year) %>%
summarize(totalPop = sum((as.double(pop)))) # we need this to prevent integer overflow
We also need to get the world population for each point in time, which the following code block does:
plot_data = plot_data %>%
group_by(year) %>%
mutate(popRatio = totalPop/sum(totalPop))
Finally, we generate a stacked-area graph, which allows us to accurately visualize the proportion of categories over time; in this case, it is the continents and how their population progresses as a proportion of the world population.
(improved_graph = plot_data %>%
ggplot(aes(year, popRatio, fill=continent)) +
geom_area(position = 'stack') +
xlab('year') +
ylab('Percentage of world population') +
ggtitle('Proportion of world population in continents over time'))
We can see from this graph that Asia has the majority of the world’s population, and Americas’ hasn’t changed must in the last 60 years or so. Africa’s population has increased and Europe’s has decreased. Oceania has always been rather un-populated.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
improved_graph_plotly = improved_graph %>% ggplotly()
Let’s take a look at the new file:
improved_graph_plotly
The most distinctive thing about the plotly graph is interactivity: I can hover my mouse over a data point and I can read it off. This seems to be better for people using Rmd but not necessarily for publishing graphs in papers because obviously such a feature is not possible on paper or pdf.
Also, this is not checkable on github because it is rendered in md.
ggsave('pop_prop_time.png', plot = improved_graph)
## Saving 7 x 5 in image
Graph of Population Proportions over time